An improved chromosome-level genome assembly and annotation of Echeneis naucrates

Echeneis naucrates, as known as live sharksucker, is famous for the behavior of attaching to hosts using a highly modified dorsal fin with oval-shaped sucking disc. Here, we generated an improved high-quality chromosome-level genome assembly of E. naucrates using Illumina short reads, PacBio long reads and Hi-C data. Our assembled genome spans 572.85 Mb with a contig N50 of 23.19 Mb and is positioned to 24 pseudo-chromosomes. Additionally, at least one telomere was identified for 23 out of 24 chromosomes. Furthermore, we identified a total of 22,161 protein-coding genes, of which 21,402 genes (96.9%) were annotated successfully with functions. The combination of ab initio predictions and Repbase-based searches revealed that 15.57% of the assembled E. naucrates genome was identified as repetitive sequences. The completeness of the genome assembly and the gene annotation were estimated to be 97.5% and 95.4% with BUSCO analyses. This work enhances the utility of the live sharksucker genome and provides a valuable groundwork for the future study of genomics, biology and adaptive evolution in this species.


Background & Summary
Live sharksucker (Echeneis naucrates), also known as the sluggard in the ocean, is in the Echeneidae family, order Carangiformes (Fig. 1).This sharksucker is widely found in tropical and warm temperate waters 1 , and ranging from coastal areas to those offshore 2 .The key distinctive characteristic to distinguish it from other fishes is the oval-shaped sucking disc, which is a highly modified dorsal fin and used to attach to hosts.The oval-shaped sucking disc comprises of 21-28 laminae and extends from the top of the head to the front part of the body 3 .The hosts of live sharksucker encompass whales, sharks, dolphins, sea turtles, divers and vessel hulls [4][5][6][7] .With a host, proposed benefits to live sharksucker comprise conveyance (via "hitchhiking"), shielding from predators, enhanced courtship and reproductive capacity, improved gill aeration and expanded feeding opportunities 8 .The unique suction cups and adsorption habits make the live sharksucker a good research subject for bionic study 9,10 , aid in fishing 11 and adaptive evolution, such as the commensalism relation between remora fish and shark 12 .Nonetheless, our comprehension of the biological context of the live sharksucker remains constrained.
Genome sequencing has played a pivotal role in advancing various aspects of basic biology.High-quality reference genomes could profoundly enhance our understanding of the genetic foundation and the evolutionary process underlying unique biological characteristics in the live sharksucker.Although the chromosome-level live sharksucker genome has been released on NCBI with GenBank assembly accession GCA_900963305.1 13,14 and GCA_900963305.2 15 , the completeness of genome assembly and annotations still require further refinement.For instance, the released chromosome-level genome assembly remained incomplete with many gaps (average 110.13 N's per 100 kbp) (Fig. 3b).Not only that, a number of annotation details, including information related to repeats and non-coding RNAs, have not been made publicly available and remain inaccessible.
In this study, we generated 33.14 Gb of PacBio High fidelity (HiFi) long-reads with the N50 length of 18.11 kb, and 89.93 Gb of Illumina paired-end sequencing short-reads for genome assembly (Table 1).An additional 76.64 Gb of high-throughput chromatin capture (Hi-C) sequencing data were utilized to validate the genome assembly through a comparison with the scaffolding data.Leveraging these integrated sequencing data, we constructed a high-quality chromosome-level reference genome of E. naucrates.Specifically, a 572.85 Mb genome was assembled, comprised of 54 contigs with the contig N50 length of 23.19 Mb.A total of 570.71 Mb (99.63% of the contig-level genome) of the assembled sequences were positioned to 24 pseudo-chromosomes with low missing bases (average 0.40 N's per 100 kbp).Moreover, telomeres were identified for at least one end of 23 out of 24 chromosomes, totaling 38 telomeres (Fig. 3a and Table 7).In this enhanced genome assembly, we have improved upon previous gene annotations by amalgamating ab initio predictions, protein homology searches and transcriptome-assisted methods, which identified a total of 22,161 protein-coding genes.Through a dual approach involving both homology searches and ab initio predictions, 15.57% of the assembled E. naucrates genome was identified as repetitive sequences.BUSCO alignment analysis of assembly based on the actinop-terygii_odb10 database revealed that our ultimate assembly encompassed 3, 551 (97.5%) complete BUSCOs.The consensus QV of genome assembly was 52.01.In summary, this high-quality chromosome-level reference genome serves as a valuable foundation for the utilization of genetic resources, and the further investigation of the unique biological characteristics, such as the oval-shaped sucking disc, in the live sharksucker.

Methods
Sample collection and preparation.A single fish (~1500 g) was obtained in June 2022 from Northern South China Sea.The sampled fish in this study was permitted by the Animal Care and Use Committee of Fisheries College of Jimei University (Animal Ethics no.1067) and performed by the regulations and guidelines established with this committee.Dorsal muscle, dorsal fin, skin, skull, and skull muscle tissues were collected and preserved in liquid nitrogen until the extraction of DNA and RNA.Dorsal muscle tissues were utilized for DNA sequencing to construct the genome assembly, while all tissues were utilized for RNA sequencing.The quality and quantity of genomic DNA samples were assessed through 1% agarose gel electrophoresis and the Pultton DNA/ Protein Analyzer (Plextech).

WGS illumina library construction, sequencing and assembly.
To create the whole-genome sequencing (WGS) Illumina library, a paired-end library was constructed with an insert size of 300 bp adhering to the Illumina standard protocol.Then, DNA was purified, quantified, and sequenced from both ends using the Illumina NovaSeq 6000 sequencing platform.In total, a sum of 89.93 Gb raw reads was obtained (Table 1).After filtering process by using fastp v 0.23.2 16 with default parameters to remove low quality and short reads, as well as trim adapters and polyG sequences, a set of 87.77 Gb clean data were retained (Table 1).The estimation of the genome size and heterozygosity for live sharksucker was then performed using GCE v 1.0.0 17 by k-mer analysis with clean Illumina short data following the default settings.pacBio library construction, sequencing and assembly.To obtain the PacBio long reads, a SMRTbell library was established with a fragment size of 20 kb using the SMRTBell template preparation kit 1.0 (PacBio) in accordance with the manufacturer's instructions.The library was sequenced with the PacBio Sequel II system in Circular Consensus Sequence (CCS) mode.Upon the elimination of low-quality reads, a sum of 33.14 Gb reads with an average length of 17.90 kb were retained and then processed with the CCS v 6.0.0 algorithm with default parameters.With these PacBio long reads, the initial contigs were subsequently assembled using the Hifiasm v 0.16.1 algorithm 18 with the default settings.After that, the purge_haplotigs v1.0.4 19 with the parameter of '-a 70 -j 80 -d 200' was employed to eliminate redundant sequences.This procedure resulted in a contig-level assembly of about 588.30Mb comprised of 54 contigs, with the N50 and maximum contig size of 23.19 Mb and 29.49Mb, respectively.
Hi-C library preparation, sequencing and chromosome assembly.Hi-C data were used to anchor contigs onto chromosomes.Briefly, dorsal muscle tissue (~1 g) of E. naucrates was fixed with 1% formaldehyde for 10-30 min at room temperature (20-25 °C) to congeal proteins involved in chromatin interactions within the genome.DNA was digested with the 4-cutter restriction enzyme MboI.The overhangs of restriction fragments  were filled and labeled with biotinylated nucleotides, followed by ligation in a compact volume.Following the cross-link reversal, the ligated DNA was purified and fragmented to a size range of 300-500 bp.Subsequently, ligation junctions were extracted by binding to streptavidin beads and prepared for Illumina NovaSeq 6000 sequencing.In total, 76.64 Gb of Hi-C reads were obtained (Table 1).After filtering reads with average quality scores less than 20 and removing adapters using fastp v 0.23.2 16 with the default settings, a total of 76.56 Gb clean data were retained (Table 1).We also utilized the HiCUP pipeline 20 , with the parameter of '--re1 ^GATC,MboI' in hicup_digester step, to remove the erroneous mappings and duplicated contigs to yield the interaction matrix.This matrix served as the foundation for anchoring the contigs onto chromosomes through the utilization of approximately 169.29 Mb read pairs (~ 68.27%) via the 3D-DNA pipeline 21 with the default settings.The scaffolds were subjected to a manual assessment and refinement process utilizing Juicebox Assembly Tools 22 in order to rectify any instances of chromosome translocation and inversion.By integrating this Hi-C data, the contig-level assembled sequences were positioned onto 24 pseudo-chromosomes, encompassing a cumulative length of 570.71 Mb, covering ~99.63% of the contig-level genome (Fig. 2).
RNA library construction and transcriptome sequencing.Total RNA was extracted from five tissues of the live sharksucker, including dorsal muscle, dorsal fin, skin, skull, and skull muscle using TRIzol reagent (Invitrogen).To assess RNA quality, both a NanoDrop ND-1000 spectrophotometer (Labtech) and a 2100 Bioanalyzer (Agilent Technologies) were employed.The paired-end raw sequencing was performed using the Novaseq 6000 Platform.In sum, 33.01 Gb of clean data were generated from the RNA-seq library after filtering process using fastp v 0.23.2 16 with default parameters (Table 1).

Repetitive sequence annotation.
Repeat elements within the live sharksucker genome were comprehensively identified through a dual approach involving both homology searches and ab initio predictions.The ab initio prediction of repeat elements was executed using both Tandem Repeat Finder v 4.09 23 and LTR_FINDER_ parallel v1.1 23 with default parameters.Subsequently, novel repeats were predicted utilizing RepeatMasker according to the de novo repetitive sequence library constructed with LTR_FINDER_parallel and RepeatModeler v 2.0 24 following default parameters.To identify known repeat elements for genome sequences, RepeatMasker v 4.0.9 25 and RepeatProteinMask v 4.1.0(http://www.repeatmasker.org) with default parameters were employed, by querying the genome sequences against the Repbase database 26 .The integration of ab initio predictions and Repbase-based searches unveiled that 15.57% of the assembled E. naucrates genome was identified as repetitive sequences (Fig. 4).Among which, repetitive DNAs, LINEs, SINEs and LTRs covered 5.74%, 4.03%, 2.27% and 1.85% of the entire genome, respectively (Table 3).

Gene prediction and annotation.
Using the repeat-masked genome, the prediction of protein-coding genes within the live sharksucker genome was approached through three strategies: ab initio predictions, Fig. 2 Hi-C interaction heat map for genome assembly of E. naucrates.The interaction density is quantified based on the number of supporting Hi-C reads and depicted using a color gradient ranging from white (low density) to dark red (high density).

Average gene length (bp)
Average CDS length (bp)
identification of telomeres.Based on the common characteristic sequences (CCCTAA/TTAGGG) of vertebrates, telomere sequences are identified through pattern searching at both ends of each chromosome, where the characteristic sequence repeats at least four times within a 50 kb region.All 38 telomeres were annotated within the 23 chromosomes, with no telomere sequence detected on chr7 (Fig. 3a and Table 7).

Data Records
The raw sequencing dataset of E. naucrates in this study can be achieved from Sequence Read Archive (SRA) under SRP457893 49 , including WGS Illumina sequencing data (SRR25859131), Pacbio HiFi sequencing data (SRR25859130) and Hi-C sequencing data (SRR25859129).The assembled genome of E. naucrates was deposited at GenBank under accession GCA_031770045.1 50 .Furthermore, files of the assembled genome, protein-coding gene annotation, non-coding RNA prediction and repeat annotation of E. naucrates were deposited in Figshare database 51 .

technical Validation
Assessing the quality of the genome assembly.We initially used QUAST v 5.2.0 52  GCA_900963305.2).Furthermore, in this study, the genome exhibits an exceptionally low gap count (average 0.40 N's per 100 kbp) (Table 2; Fig. 3a), marking a substantial reduction compared to the previous versions of average 110.13N's per 100 kbp (Fig. 3b).Next, we remapped Illumina paired-end clean reads and PacBio long reads to the final assembled genome using BWA 53 and Minimap2 54 , resulting in mapping rates of 99.62% and 99.98%, respectively.Homozygous SNP rate was 0.00% when aligned Illumina paired-end clean reads to the final assembly, underscoring the comprehensiveness of the complete genome (Table 8).Furthermore, the completeness of the assembled genome sequence was assessed with Benchmarking Universal Single-Copy Orthologs (BUSCO, v 5.1.0) 55based on the actinopterygii_odb10 database.The BUSCO analysis of assembly showed that 3,551 (97.5%) of the complete orthologs, including 3,514 (96.5%) single-copy orthologs and 37 (1.0%) duplicated orthologs, as well as 14 (0.4%) fragmented orthologs were identified (Table 9).The consensus quality value (QV) of the assembly, estimated using Merqury 56 (kmer = 21), was 52.01.Assessing the quality of the genome annotation.The BUSCO analysis of annotation based on the actinopterygii_odb10 database, which was used to assess the integrity of the annotated gene set, revealed that 95.4% (3,473) of the complete genes were identified, comprising 94.4% (3,437) single-copy genes, 1.0% (36) duplicated genes, and 1.3% (46) fragmented genes (Table 9).Taken together, the comprehensive assessment of the E. naucrates genome surpassed that of other existing public E. naucrates genomes.

Fig. 3
Fig. 3 Comparison of genome assembly of E. naucrates with the previous version.Contig distribution maps for chromosomes of E. naucrates between the assembly (a) in this study and (b) the previous version.The bars in grey represent entire lengths of chromosomes, in which the positions of telomeres are shown.The contig numbers and the sizes of chromosomes were shown behind the bars.

Table 1 .
Statistics of sequencing data for E. naucrates genome assembly and annotation.

Table 3 .
Statistics on transposable elements in E. naucrates genome.

Table 2 .
Comparison of E. naucrates genome assembly metrics with previous version.

Table 5 .
to evaluate the integrity and quality of E. naucrates genome assembly.The contig N50 (the length at which half of the total sequence resides in contigs of this size) has shown a significant improvement, reaching 23.19 Mb, which significantly surpasses previous E. naucrates genome versions of 12.4 Mb (GenBank assembly accession: GCA_900963305.1, Summary of functional annotations for predicted genes.

Table 6 .
Statistics of ncRNA in E. naucrates genome.

Table 9 .
Statistics of BUSCO assessment.